Read control in a computer i/o interconnect

ABSTRACT

In one embodiment, a method for controlling reads in a computer input/output (I/O) interconnect is provided. A read request is received over the computer I/O interconnect from a first device, the request requesting data of a first size. Then it is determined whether fulfilling the read request would cause the total size of a completion queue to exceed a first predefined threshold. If fulfilling the read request would cause the total size of the completion queue to exceed the first predefined threshold, then the read request is temporarily restricted from being forwarded upstream

CROSS REFERENCE TO RELATED APPLICATIONS

This patent application takes priority under 35 U.S.C. 119(e) to (i)U.S. Provisional Patent Application No. 61/014,685, filed on Dec. 18,2007 (Attorney Docket No. PLXTP001P) entitled “PLX ARCHITECTURE”, byGeorge Apostol, and (ii) U.S. Provisional Patent Application No.61/015,613 filed on Dec. 20, 2007 (Attorney Docket No. PLXTP002P)entitled “PLX SOFTWARE DEVELOPMENT KIT”, by George Apostol each of whichare incorporated by reference in their entirety for all purposes.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to computer I/O interconnects.More particularly, the present invention relates to read control in acomputer I/O interconnect.

2. Description of the Related Art

In a computer architecture, a bus is a subsystem that transfers databetween computer components inside a computer or between computers.Unlike a point-to-point connection, a different type of computerinput/output (I/O) interconnect, a bus can logically connect severalperipherals over the same set of wires. Each bus defines its set ofconnectors to physically plug devices, cards or cables together.

There are many different computer I/O interconnect standards available.One of the most popular over the years has been the peripheral componentinterconnect (PCI) standard. PCI allows the bus to act like a bridge,which isolates a local processor bus from the peripherals, allowing aCentral Processing Unit (CPU) of the computer to run must faster.

Recently, a successor to PCI has been popularized. Termed PCI Express(or, simply, PCIe). PCIe provides higher performance, increasedflexibility and scalability for next-generation systems, whilemaintaining software compatibility with existing PCI applications.Compared to legacy PCI, the PCI Express protocol is considerably morecomplex, with three layers—the transaction, data link and physicallayers.

In a PCI Express system, a root complex device connects the processorand memory subsystem to the PCI Express switch fabric comprised of oneor more switch devices (embodiments are also possible without switches,however). In PCI Express, a point-to-point architecture is used. Similarto a host bridge in a PCI system, the root complex generates transactionrequests on behalf of the processor, which is interconnected through alocal I/O interconnect. Root complex functionality may be implemented asa discrete device, or may be integrated with the processor. A rootcomplex may contain more than one PCI Express port and multiple switchdevices can be connected to ports on the root complex or cascaded.

PCI Express also supports split read completions. This means that thecompletion of read request initiated at a particular time may not beperformed until a later time. Essentially, the read request must wait ina queue until it is serviced. Since a request is typically only 12-20bytes, whereas the size of a completion response can range up to 4096bytes, there is a natural imbalance where requests can accumulate fasterthan data can be returned.

This relative size imbalance between requests and completion dataresponses can negatively affect performance if too many requests areactive at one time. This is especially true in a typical PCIe systemwhere multiple downstream devices all try to read from a single rootcomplex, and wherein the root complex typically services the readrequests in a first-come-first-served fashion. If the requests are forlarge amounts of data, a long read request queue can develop in the rootcomplex as it services the requests. This long queue can be exacerbatedif the final data destination (the source of the read request) has lessbandwidth than the data supplier (the request destination), which iscommon in host-centric PCIe systems, where the link closest to the rootcomplex is typically the widest. Once intermediary buffers are filled,the bandwidth of the root complex effectively reduces to the bandwidthof the data sink.

If a new downstream device sends its first read request into this longqueue of requests in the destination, the new read request will wait forthe entire read request queue ahead of it to drain before it will getserviced. The long wait time for a response can dramatically impactperformance.

For example, suppose a PCIe switch connects a single ×8 upstream port totwo ×4 downstream ports. One downstream port has a FibreChannel RAM diskthat is capable of sending 16 1024 byte memory read requests at a time.The other downstream port is a dual Gigabit Ethernet controller that cansend 2 read requests at a time (1 per channel), with the read size beingeither 16 bytes (for a descriptor) or 1500 bytes (for an Ethernetpacket). The root complex sends 64 byte completions, so a 1024 byte readrequest would result in 16 partial completions.

By itself, the Ethernet controller may process 1885 Mb/s with a memoryread latency of an Ethernet channel being around 340 ns. When theFibreChannel RAM disk is plugged in, however, the FibreChannel RAM diskprocesses 752 MB/s of completions (the same as it normally does) whilethe Ethernet controller performs 180 Mb/s. Here the memory read latencyof the Ethernet channel is around 6200 ns. Thus, when both devices areon, the FibreChannel RAM Disk interferes with the Ethernet controllereven though the FibreChannel RAM Disk performance itself was notaffected. This is because the FibreChannel RAM Disk initially fills theswitch's buffer with completions at a ×8 rate, but then the upstreambandwidth drops to a ×4 rate, due to the switch's downstream link to theFibreChannel device being only ×4. Due to the congestion, the Ethernetcontroller takes much longer to get data back, as seen from theincreased latency. Since the Ethernet device can have only 2 readsoutstanding, a longer response for those reads results in a major dropin performance.

The above example illustrates how the aggressive reading behavior of onedevice can dramatically and negatively affect another PCIe device. Thereis nothing forbidden about this configuration, and by themselves thedevices each seem to perform quite well, making this a problem that acursory analysis of the system would not reveal.

SUMMARY OF THE INVENTION

In one embodiment, a method for controlling reads in a computerinput/output (I/O) interconnect is provided. A read request is receivedover the computer I/O interconnect from a first device, the requestrequesting data of a first size. Then it is determined whetherfulfilling the read request would cause the total size of a completionqueue to exceed a first predefined threshold. If fulfilling the readrequest would cause the total size of the completion queue to exceed thefirst predefined threshold, then the read request is temporarilyrestricted from being forwarded upstream

In another embodiment, a read request is received over the computer I/Ointerconnect from a first device. Then it is determined if forwardingthe read request upstream would cause the rate at which read requestsare forwarded to exceed a drain rate of a completion queue by more thana predefined threshold. If forwarding the read request upstream wouldcause the rate at which read requests are forwarded upstream to exceed adrain rate of the completion queue by more than the predefinedthreshold, then the read request is temporarily restricted from beingforwarded upstream.

In another embodiment a system is provided comprising: an interface; andone or more components configured to: receive a read request over thecomputer I/O interconnect from a first device, the request requestingdata of a first size; determine whether fulfilling the read requestwould cause the total size of a completion queue to exceed a firstpredefined threshold; and temporarily restrict the read request frombeing forwarded upstream if fulfilling the read request would cause thetotal size of the completion queue to exceed the first predefinedthreshold.

In another embodiment, a system is provided comprising: an interface;and one or more processors configured to: receive a read request overthe computer I/O interconnect from a first device; determine ifforwarding the read request upstream would cause the rate at which readrequests are forwarded to exceed a drain rate of a completion queue bymore than a predefined threshold; and temporarily restrict the readrequest from being forwarded upstream if forwarding the read requestupstream would cause the rate at which read requests are forwardedupstream to exceed a drain rate of the completion queue by more than thepredefined threshold.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a system for controlling reads ina computer I/O interconnect in accordance with an embodiment of thepresent invention.

FIG. 2 is a flow diagram illustrating a method for controlling reads ina computer I/O interconnect in accordance with an embodiment of thepresent invention.

FIG. 3 is a flow diagram illustrating a method for controlling reads ina computer I/O interconnect in accordance with an embodiment of thepresent invention.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

Reference will now be made in detail to specific embodiments of theinvention including the best modes contemplated by the inventors forcarrying out the invention. Examples of these specific embodiments areillustrated in the accompanying drawings. While the invention isdescribed in conjunction with these specific embodiments, it will beunderstood that it is not intended to limit the invention to thedescribed embodiments. On the contrary, it is intended to coveralternatives, modifications, and equivalents as may be included withinthe spirit and scope of the invention as defined by the appended claims.In the following description, specific details are set forth in order toprovide a thorough understanding of the present invention. The presentinvention may be practiced without some or all of these specificdetails. In addition, well known features may not have been described indetail to avoid unnecessarily obscuring the invention.

In accordance with the present invention, the components, process steps,and/or data structures may be implemented using various types ofoperating systems, programming languages, computing platforms, computerprograms, and/or general purpose machines. In addition, those ofordinary skill in the art will recognize that devices of a less generalpurpose nature, such as hardwired devices, field programmable gatearrays (FPGAs), application specific integrated circuits (ASICs), or thelike, may also be used without departing from the scope and spirit ofthe inventive concepts disclosed herein. The present invention may alsobe tangibly embodied as a set of computer instructions stored on acomputer readable medium, such as a memory device.

One solution to the congestion problem described in the background ofthe invention would be to tune the system to have one of the devicesbehave differently. For instance, in the example provided above, theFibreChannel RAM Disk can be set such that the read rate or read size isreduced. This solution, however, requires anticipating the problembeforehand. It also requires knowledge of the drivers of the relevantendpoints/components. Many of these drivers may not be known withoutinvestigation, and such a solution would require constantly updating thesystem when new devices are attached.

In an embodiment of the present invention, a set of mechanisms are addedthat balance the rate of requests with the resulting data fulfilling therequests, reducing the maximum size of the destination queue and alsoensuring that the destination bandwidth is not reduced to any sourcebandwidth. These mechanisms may be generically referred to as readpacing and read spacing.

The present invention may be applied to any protocol that permits thesplitting of read requests and read completions. This includes, but isnot limited to, PCIe, PCI-x, Infiniband, RapidIO, and Hypertransport.Additionally, the present invention may be applied to any system orprotocol that has been modified to permit the splitting of read requestsand read completions. Therefore, while legacy PCI does not typicallysupport the splitting of read requests and read completions, if a systemrunning legacy PCI were modified to permit such splitting, the inventioncould be applied to it.

Read pacing is based on the idea that only so many requests need to beoutstanding at a time in order to ensure uninterrupted completion, andany extra read requests beyond that only cause queues to develop. Adevice with read pacing counts up, per source, how much data isrequested in total. The counter may be labeled as “read count”. Eachadditional request adds its read size to the read count. As the data isreturned, the total read count is reduced according to the amount ofdata returned. Whenever the read count is larger than a threshold,subsequent requests from that source are held in the device and notforwarded to the final destination queue until the read count dropsbelow the threshold again.

By placing a limit on the amount of data requests, the length of thefinal destination queue is similarly constrained. The limit is relatedto hardware resources on the device such that all requested data can bestored on the device without overflowing device buffer spaces. In otherwords, the threshold is related to the size of a completion buffer andthe typical round trip time from read to completion. If, for example,all ports are reading 1 port (a typical host fanout application has alldownstream ports read the main memory on the upstream port), then allcompletions arrive on one port (i.e. there is 1 destination buffer). If4 ports are sharing the upstream completion buffer, then the thresholdcan only be ¼ as much as if there were only 1 aggressive reading device.

For purposes of this document, an aggressive reading device shall beinterpreted to mean a device that sends out read requests in a mannerthat does cause the latency between it and the data source to exceed thetypical latency.

For example, if there is about 28 KB of space in the buffer availablefor the upstream completion queue for the upstream port and there are 4equally aggressive reading downstream ports, each port should get about¼ of the buffer. Thus, the threshold for read pacing in this example maybe set to approximately 7 KB.

It should be noted that there may be many different ways to enforce thethreshold. One way, as described above, is to use a “read count”counter. Another way, however, would be to simply limit the size of thebuffer so that it cannot possibly hold more data than the set threshold.In the above example, for instance, the buffer can simply be set with asize of 7 KB.

Turning now to read spacing, this addresses the case where multiplereads are sent closely together. If used together with read pacing, readspacing only is concerned with multiple reads when the threshold has notyet been exceeded. There is theoretically no need to send reads closertogether than the data can be sent back. Therefore, by spreading out theread requests based on the rate that the source can utilize theresulting data, no performance is lost and the queue in the destinationbuffer is kept minimal. It should be noted that in one embodiment, theread rate may be higher than the data rate to account for times when theread request cannot be handled immediately by the destination. The readrate will develop a data buffer up to the limit specified by read pacingin order to smooth out completion data traffic.

In one embodiment of the present invention, the read spacing is set toallow the read rate to exceed the drain rate by no more than 2 times.However, this can be a programmable value. The reason to program itlarger would be to fill an on-chip buffer more quickly, whereas asmaller value would fill it more slowly. If main memory is heavilycongested, this means that there is likely multiple downstream branchesfeeding into it, since the CPU typically wins all accesses to mainmemory over other devices' accesses. For example, if a root complex has2 or more downstream ports, each having a PCI switch feeding to yet evenmore downstream ports, and all downstream ports are trying to read themain memory simultaneously, then the memory controller may getoverloaded.

The net effect of these mechanisms is to maintain destination bandwidthand reduce read request queue size in the memory controller, both ofwhich will improve overall performance.

It should be noted that the term “read request queue” shall beinterpreted to mean any queue that contains, or is designed to contain,read requests. Embodiments are possible where the queue also containsother requests or data. Such queues shall also be considered to be readrequest queues as long as they hold read requests.

The present invention may be implemented in various places in a computerI/O interconnect. For purposes of this document, a computer I/Ointerconnect shall be defined as a data transmission medium linkingdevices in a computer system. This may include, for example, a parallelmultidrop bus, as is utilized in the PCI-x protocol. This may alsoinclude, for example, a point-to-point architecture, as is used in thePCI Express protocol.

FIG. 1 is a block diagram illustrating a system for controlling reads ina computer I/O interconnect in accordance with an embodiment of thepresent invention. Each component in the system may be embodied inhardware, software, or any combination thereof. In this diagram, thereare five devices 100 a-e connected to an I/O interconnect system.Devices 100 a-100 c may be connected to a switch 102. In the PCI Expressand other similar protocols, devices that initiate a read request may beknown as endpoints.

It should be noted that while a single switch is depicted in FIG. 1, oneof ordinary skill in the art will recognize that multiple switches maybe utilized in a parallel, serial, or hierarchical configuration inorder to accomplish the same goals. The switch 102 may include anupstream read request queue 104 a, 104 b, 104 c corresponding to each ofthe ports connected to devices 100 a-100 c. The switch 102 may beconnected to a root complex 106. Also connected to the root complex 106are devices 100 d-100 e. Like the switch 102, the root complex 106 mayalso contain an upstream read request queue 108 a, 108 b, 108 c, herewith queue 108 a corresponding to the input from switch 102 and queues108 b and 108 c corresponding tot he inputs from devices 100 d-100 e.The root complex 106 controls a memory controller 110, which in turn mayalso house an upstream read request queue 112.

It should be noted that while read requests and read request queues aredescribed in various portions of this specification, the presentinvention may also be applied to other types of requests and/or queues,and thus the claims are not to be limited to read requests or readrequest queues unless specifically stated.

Each upstream read request queue acts to hold incoming read requestsuntil they can be acted upon by the device housing the queue. Once theyare handled, they are placed in a downstream read request queue untilthey can be sent to another device.

The memory controller 110 may control main memory (not pictured). When adevice 100 a initiates a read request, the request may first pass toswitch 102, where it is placed in upstream read request queue 104 a.Once it has been acted upon by switch 102, it is placed in downstreamread request queue 114 until it can be sent to root complex 106. Once itarrives at root complex 106, it is placed in upstream read request queue108 a. Once it has been acted upon by root complex 106, it is passed tomemory controller 110, where it is placed in upstream read request queue112. Once it has emerged from upstream read request queue 112, it isserviced and the appropriate completion response is formed from theinformation in memory.

This completion response may then be placed in completion queue 116.Once it has emerged from upstream completion queue 116, the completionresponse may be passed to root complex 106, where it is placed in anappropriate downstream completion queue (here, downstream completionqueue 118 a, which corresponds to the interconnect between the rootcomplex 106 and switch 102, in contrast to downstream completion queues118 b and 118 c, which correspond to the interconnects between the rootcomplex 106 and devices 100 d and 100 e, respectively).

Once the completion has emerged from downstream completion queue 118 a,it may be passed to switch 102, where it placed in upstream completionqueue 120. Once the switch 102 has finished with the completion, it maybe placed in downstream completion queue 122 a, which corresponds to theinterconnect between the switch 102 and the device 100 a (in contrast tothe downstream completion queues 122 b and 122 c, which correspond tothe interconnects between the switch 102 and devices 100 b and 100 c,respectively).

Various aspects of the present invention may be implemented at any ofthe upstream read request queues. For purposes of this document, theterm “final destination read request queue” may be defined as the readrequest queue closest to the destination where the underlying data torespond to the read request resides. In FIG. 1, for example, theupstream read request queue 112 is the final destination read requestqueue.

FIG. 2 is a flow diagram illustrating a method for controlling reads ina computer I/O interconnect in accordance with an embodiment of thepresent invention. Each step of this method may be performed insoftware, hardware, or any combination thereof. If performed insoftware, the method may be implemented as computer-readableinstructions stored in a program storage device. This method may begenerally termed “read pacing.” This method may be performed by one ormore components in a computer system. One of those components may be aroot complex of a PCIe switch. Another component may be a switch.Another component may be a memory controller. At 200, a read request isreceived over the computer I/O interconnect from a first device, therequest requesting data of a first size. At 202, it is determinedwhether fulfilling the read request would cause the total size of acompletion queue to exceed a first predefined threshold. Thisdetermination may include adding the first size to a read counter andcomparing the read counter to the first predetermined threshold. Thefirst predetermined threshold may be set based on, for example, a sizeof memory available for the completion queue and a typical round triptime from read to completion from the upstream read request queue. Thismay include dividing the size of the memory available for the completionqueue by the number of ports of the component controlling the upstreamread request queue that are connected to an aggressive reading device.

If fulfilling the read request would cause the total size of thecompletion queue to exceed the first predefined threshold, then at 206the read request is temporarily restricted from being forwardedupstream. If, on the other hand, fulfilling the read request would notcause the total size of the destination queue to exceed the firstpredefined threshold, then at 206 the read request may be forwardedupstream. Then at 208, the first size may be added to the read counter.At 210, once the read request is fulfilled, the first size may besubtracted from the read counter.

FIG. 3 is a flow diagram illustrating a method for controlling reads ina computer I/O interconnect in accordance with an embodiment of thepresent invention. Each step of this method may be performed insoftware, hardware, or any combination thereof. If performed insoftware, the method may be implemented as computer-readableinstructions stored in a program storage device. This method may beperformed by one or more components in a computer system. One of thosecomponents may be a root complex of a PCIe switch. Another component maybe a switch. Another component may be a memory controller. At 300, aread request is received over the computer I/O interconnect from a firstdevice. At 302, it is determined if forwarding the read request upstreamwould cause the rate at which read requests are forwarded upstream toexceed a drain rate of the completion queue by more than a secondpredefined threshold. This threshold may be expressed, for example, as amultiplication factor between the rate at which read requests areforwarded upstream and the drain rate of the completion queue. Forexample, the threshold may be set at two times the drain rate of thecompletion queue. If the rate at which read requests are forwardedupstream exceeds this, then the threshold has been breached. At 304, theread request is temporarily restricted from being forwarded upstream ifforwarding the read request upstream would cause the rate at which readrequests are forwarded upstream to exceed a drain rate of the completionbuffer by more than a second predefined threshold.

It should be noted that while embodiments are foreseen wherein readpacing is performed without read spacing and vice-versa, in oneembodiment of the present invention, both are performed. For example,the steps of FIG. 2 and FIG. 3 above may be combined into a singlemethod, with the read spacing method being performed on read requeststhat would not cause the total size of a completion queue to exceed afirst predefined threshold.

While the invention has been particularly shown and described withreference to specific embodiments thereof, it will be understood bythose skilled in the art that changes in the form and details of thedisclosed embodiments may be made without departing from the spirit orscope of the invention. In addition, although various advantages,aspects, and objects of the present invention have been discussed hereinwith reference to various embodiments, it will be understood that thescope of the invention should not be limited by reference to suchadvantages, aspects, and objects. Rather, the scope of the inventionshould be determined with reference to the appended claims.

1. A method for controlling reads in a computer input/output (I/O)interconnect, the method comprising: receiving a read request over thecomputer I/O interconnect from a first device, the request requestingdata of a first size; determining whether fulfilling the read requestwould cause the total size of a completion queue to exceed a firstpredefined threshold; and temporarily restricting the read request frombeing forwarded upstream if fulfilling the read request would cause thetotal size of the completion queue to exceed the first predefinedthreshold.
 2. The method of claim 1, wherein the determining includesadding the first size to a read counter and comparing the read counterto the first predetermined threshold.
 3. The method of claim 2, furthercomprising, if fulfilling the read request would not cause the totalsize of the completion queue to exceed the first predefined threshold:forwarding the read request upstream; and adding the first size to theread counter.
 4. The method of claim 3, further comprising: when theread request is fulfilled, subtracting the first size from the readcounter.
 5. The method of claim 1, wherein the method is performed in aroot complex.
 6. The method of claim 1, wherein the first predeterminedthreshold is set based on a size of memory available for the completionqueue and a typical round trip time from read to completion from theupstream read request queue.
 7. The method of claim 1, wherein themethod is performed in a switch.
 8. The method of claim 7, wherein thefirst predetermined threshold is determined by dividing the memory sizeavailable for the completion queue by the number of ports of the switchthat are connected to aggressive reading devices.
 9. The method of claim1, wherein the method is performed in a memory controller.
 10. Themethod of claim 1, further comprising: receiving a read request over thecomputer I/O interconnect from a second device; determining whetherfulfilling the read request from the second would cause the total sizeof the completion queue to exceed a first predefined threshold; and ifthe read request from the second device would not cause the total sizeof the completion queue to exceed the first predefined threshold,determining if forwarding the read request upstream would cause the rateat which read requests are forwarded upstream to exceed a drain rate ofthe completion queue by more than a second predefined threshold; andtemporarily restricting the read request from the second device frombeing forwarded upstream if forwarding the read request from the seconddevice upstream would cause the rate at which read requests areforwarded upstream to exceed a drain rate of the completion queue bymore than the second predefined threshold.
 11. The method of claim 10,wherein the second predefined threshold is expressed as a multiplicationfactor between the rate at which read requests are forwarded upstreamand the drain rate of the completion queue.
 12. A method for controllingreads in a computer I/O interconnect, the method comprising: receiving aread request over the computer I/O interconnect from a first device;determining if forwarding the read request upstream would cause the rateat which read requests are forwarded upstream to exceed a drain rate ofa completion queue by more than a predefined threshold; and temporarilyrestricting the read request from being forwarded upstream if forwardingthe read request upstream would cause the rate at which read requestsare forwarded upstream to exceed a drain rate of the completion queue bymore than the predefined threshold.
 13. The method of claim 12, whereinthe predefined threshold is expressed as a multiplication factor betweenthe rate at which read requests are forwarded upstream and the drainrate of the completion queue.
 14. A system comprising: an interface; andone or more components configured to: receive a read request over thecomputer I/O interconnect from a first device, the request requestingdata of a first size; determine whether fulfilling the read requestwould cause the total size of a completion queue to exceed a firstpredefined threshold; and temporarily restrict the read request frombeing forwarded upstream if fulfilling the read request would cause thetotal size of the completion queue to exceed the first predefinedthreshold.
 15. The system of claim 14, further comprising: a switch; aroot complex coupled to the switch; a memory controller coupled to theroot complex; and a memory coupled to the memory controller.
 16. Thesystem of claim 15, wherein the one or more components are located inthe memory controller.
 17. The system of claim 15, wherein the one ormore components are located in the root complex.
 18. The system of claim15, wherein the one or more components are located in the switch.
 19. Asystem comprising: an interface; and one or more processors configuredto: receive a read request over the computer I/O interconnect from afirst device; determine if forwarding the read request upstream wouldcause the rate at which read requests are forwarded upstream to exceed adrain rate of a completion queue by more than a predefined threshold;and temporarily restrict the read request from being forwarded upstreamif forwarding the read request upstream would cause the rate at whichread requests are forwarded upstream to exceed a drain rate of thecompletion queue by more than the predefined threshold.